Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added streams to CSV reader and writer api #14340

Merged
merged 12 commits into from
Nov 14, 2023

Conversation

shrshi
Copy link
Contributor

@shrshi shrshi commented Oct 27, 2023

Description

This PR contributes to #13744.
-Added stream parameters to public APIs
cudf::io::read_csv
cudf::io::write_csv
-Added stream gtests

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Oct 27, 2023

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions bot added libcudf Affects libcudf (C++/CUDA) code. CMake CMake build issue labels Oct 27, 2023
@shrshi shrshi added non-breaking Non-breaking change improvement Improvement / enhancement to an existing function labels Oct 27, 2023
@shrshi
Copy link
Contributor Author

shrshi commented Oct 27, 2023

/ok to test

@shrshi shrshi marked this pull request as ready for review October 30, 2023 15:44
@shrshi shrshi requested a review from a team as a code owner October 30, 2023 15:44
@shrshi
Copy link
Contributor Author

shrshi commented Oct 31, 2023

/ok to test

@karthikeyann
Copy link
Contributor

/ok to test

Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice and clean API change.
I think we can hit a few more code paths in the tests :)

TEST_F(CSVTest, CSVReader)
{
constexpr auto num_rows = 10;
auto int8_values = random_values<int8_t>(num_rows);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this the same situation as with ORC, where we don't really need the random values.

Comment on lines 89 to 98
.dtypes({cudf::data_type{cudf::type_id::INT8},
cudf::data_type{cudf::type_id::INT16},
cudf::data_type{cudf::type_id::INT32},
cudf::data_type{cudf::type_id::INT64},
cudf::data_type{cudf::type_id::UINT8},
cudf::data_type{cudf::type_id::UINT16},
cudf::data_type{cudf::type_id::UINT32},
cudf::data_type{cudf::type_id::UINT64},
cudf::data_type{cudf::type_id::FLOAT32},
cudf::data_type{cudf::type_id::FLOAT64}});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't pass the data types the reader will launch an extra kernel to infer them, so we should not include this option :)

Comment on lines 63 to 69
auto int16_values = random_values<int16_t>(num_rows);
auto int32_values = random_values<int32_t>(num_rows);
auto int64_values = random_values<int64_t>(num_rows);
auto uint8_values = random_values<uint8_t>(num_rows);
auto uint16_values = random_values<uint16_t>(num_rows);
auto uint32_values = random_values<uint32_t>(num_rows);
auto uint64_values = random_values<uint64_t>(num_rows);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICT this can all be a single column, the code path is the same

auto uint32_values = random_values<uint32_t>(num_rows);
auto uint64_values = random_values<uint64_t>(num_rows);
auto float32_values = random_values<float>(num_rows);
auto float64_values = random_values<double>(num_rows);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be good to add a string column, there are additional thrust calls that post-process quotes (IIRC).

@shrshi
Copy link
Contributor Author

shrshi commented Nov 13, 2023

/ok to test

@shrshi shrshi requested a review from vuule November 13, 2023 20:52
@shrshi
Copy link
Contributor Author

shrshi commented Nov 13, 2023

Thank you for the feedback, @vuule. Changing the tests did reveal more code paths creating string scalars using the default stream, which are now fixed.

cpp/src/io/csv/writer_impl.cu Show resolved Hide resolved
cpp/src/io/csv/writer_impl.cu Outdated Show resolved Hide resolved
cpp/src/io/csv/writer_impl.cu Outdated Show resolved Hide resolved
@shrshi
Copy link
Contributor Author

shrshi commented Nov 13, 2023

/ok to test

@PointKernel
Copy link
Member

@shrshi you may want to resolve the merge conflicts in test CMake first to unblock CI.

@shrshi
Copy link
Contributor Author

shrshi commented Nov 13, 2023

/ok to test

Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for tracking down the default stream use!
Couple more tiny changes

cpp/src/io/csv/writer_impl.cu Outdated Show resolved Hide resolved
cpp/src/io/csv/writer_impl.cu Outdated Show resolved Hide resolved
cpp/tests/streams/io/csv_test.cpp Outdated Show resolved Hide resolved
@shrshi
Copy link
Contributor Author

shrshi commented Nov 14, 2023

/ok to test

@shrshi
Copy link
Contributor Author

shrshi commented Nov 14, 2023

/ok to test

Comment on lines +806 to +807
wrapped = cudf::make_strings_column(
d_chars, d_offsets, d_bitmask, null_count, cudf::test::get_default_stream());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this include changes from other PR? The additional stream parameter here and below seems unrelated to this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My guess is that CSV stream test is exercising this code path while no prior stream test was, so before this PR we didn't observe that there was a missing stream argument here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, this change is needed for the string column in the tests - make_strings_column is called with the default stream otherwise.

Copy link
Member

@PointKernel PointKernel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Thanks!

@shrshi
Copy link
Contributor Author

shrshi commented Nov 14, 2023

/merge

@rapids-bot rapids-bot bot merged commit 27b052d into rapidsai:branch-23.12 Nov 14, 2023
62 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CMake CMake build issue improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants